316        Bioinformatics

-2 fastq_pure/ERR1823587_pure_R2-50.fastq.gz \

--only-assembler \

--threads 4 \

--memory 16 \

--phred-offset 33 \

-k 51

mkdir metag_moderate

metaspades.py \

-o metag_moderate \

-1 fastq_pure/ERR1823601_pure_R1-50.fastq.gz \

-2 fastq_pure/ERR1823601_pure_R2-50.fastq.gz \

--only-assembler \

--threads 4 \

--memory 16 \

--phred-offset 33 \

-k 51

mkdir metag_severe

metaspades.py \

-o metag_severe \

-1 fastq_pure/ERR1823608_pure_R1-50.fastq.gz \

-2 fastq_pure/ERR1823608_pure_R2-50.fastq.gz \

--only-assembler \

--threads 4 \

--memory 16 \

--phred-offset 33 \

-k 51

Run “metaspades.py --help” to read about the usage and options of this program.

Several files are produced in the output directories: “metag_healthy”, “metag_moder-

ate”, and “metag_severe”. The files that contain the assembly sequences are the “contigs.

fasta” and the “scaffolds.fasta”. Contigs are made from read overlaps. The contigs are then

ordered, oriented, and connected with gaps filled with Ns to form the scaffolds. The K51

directory contains the individual result files for an assembly with 51-mers. However, when

multiple K directories are found, the best assembled sequences are the ones that are stored

outside these K directories. The directory “misc” contains broken scaffolds.

The file with the “.gfa” extension is in Graphic Fragment Assembly (GFA) file format in

which the sequences are represented by lines starting with “S” and the overlaps between

sequences are represented by lines starting with “L” as shown in Figure 8.3. The plus (+)

and minus () signs indicate whether the overlapping sequence is the original or its reverse

complement. The value in the form “XM” in a link indicates overlap length.

Thus, the file “assembly_graph_with_scaffolds.gfa” generated by metaSPAdes is the

GFA file that represents the final assembly of metagenomes in the sample. SPADes built

this assembly graph based on k-mers formed from the reads (vertices) and their overlaps

(edges). Then, the assembler resolves paths across the assembly vertices and outputs non-

branching paths as contigs.